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Abstract 

Two important aspects of semantic pars¬ 
ing for question answering are the breadth 
of the knowledge source and the depth of 
logical compositionality. While existing 
work trades off one aspect for another, this 
paper simultaneously makes progress on 
both fronts through a new task: answering 
complex questions on semi-structured ta¬ 
bles using question-answer pairs as super¬ 
vision. The central challenge arises from 
two compounding factors: the broader do¬ 
main results in an open-ended set of re¬ 
lations, and the deeper compositionality 
results in a combinatorial explosion in 
the space of logical forms. We propose 
a logical-form driven parsing algorithm 
guided by strong typing constraints and 
show that it obtains significant improve¬ 
ments over natural baselines. For evalua¬ 
tion, we created a new dataset of 22,033 
complex questions on Wikipedia tables, 
which is made publicly available. 


1 Introduction 


In semantic parsing for question answering, nat¬ 
ural language questions are converted into logi¬ 
cal forms, which can be executed on a knowl¬ 
edge source to obtain answer denotations. Early 
semantic parsing systems were trained to answer 
highly compositional questions, but the knowl¬ 
edge sources were limited to small closed-domain 
databases (jZelle and Mooney, 1996} Wong and 


Mooney, 2007 j Zettlemoyer and Collins, 2007 


Kwiatkowski et al., 2011). More recent work 


sacrifices compositionality in favor of using more 
open-ended knowledge bases such as Freebase 
( |Cai and Yates, 201 3j |Berant et al., 20131 |Fader 
et al., 20L4} Reddy et al., 20141. However, even 
these broader knowledge sources still define a 
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xi: “Greece held its last Summer Olympics in which year? ” 
2 /i: {2004} 

X 2 '. “In which city’s the first time with at least 20 nations?” 
1 / 2 : {Paris} 

X3: “Which years have the most participating countries?" 
2 / 3 : {2008,2012} 

Xi\ “How many events were in Athens, Greece?” 

2 / 4 : { 2 } 

£ 5 : “How many more participants were there in 1900 than 
in the first year?” 

2/5: {10} ' 

Figure 1: Our task is to answer a highly composi¬ 
tional question from an HTML table. We learn 
a semantic parser from question-table-answer 
triples {{x l ,t t ,y l )}. 


rigid schema over entities and relation types, thus 
restricting the scope of answerable questions. 

To simultaneously increase both the breadth of 
the knowledge source and the depth of logical 
compositionality, we propose a new task (with an 
associated dataset): answering a question using an 
HTML table as the knowledge source. Figure [I] 
shows several question-answer pairs and an ac¬ 
companying table, which are typical of those in 
our dataset. Note that the questions are logically 
quite complex, involving a variety of operations 
such as comparison (22), superlatives (23), aggre¬ 
gation (24), and arithmetic (25). 

The HTML tables are semi-structured and not 
normalized. For example, a cell might contain 
multiple parts (e.g., “ Beijing, China ” or “200 
km”). Additionally, we mandate that the train¬ 
ing and test tables are disjoint, so at test time, 
we will see relations (column headers; e.g., “Na¬ 
tions”) and entities (table cells; e.g., “St. Louis”) 




























that were not observed during training. This is in 
contrast to knowledge bases like Freebase, which 
have a global fixed relation schema with normal¬ 
ized entities and relations. 

Our task setting produces two main challenges. 
Firstly, the increased breadth in the knowledge 
source requires us to generate logical forms from 
novel tables with previously unseen relations and 
entities. We therefore cannot follow the typical 
semantic parsing strategy of constructing or learn¬ 
ing a lexicon that maps phrases to relations ahead 
of time. Secondly, the increased depth in com- 
positionality and additional logical operations ex¬ 
acerbate the exponential growth of the number of 
possible logical forms. 

We trained a semantic parser for this task from 
question-answer pairs based on the framework il¬ 
lustrated in Figure [2] First, relations and entities 
from the semi-structured HTML table are encoded 
in a graph. Then, the system parses the question 
into candidate logical forms with a high-coverage 
grammar, reranks the candidates with a log-linear 
model, and then executes the highest-scoring logi¬ 
cal form to produce the answer denotation. We use 
beam search with pruning strategies based on type 
and denotation constraints to control the combina¬ 
torial explosion. 

To evaluate the system, we created a new 
dataset, WikiTableQuestions, consisting of 
2,108 HTML tables from Wikipedia and 22,033 
question-answer pairs. When tested on unseen ta¬ 
bles, the system achieves an accuracy of 37.1%, 
which is significantly higher than the information 
retrieval baseline of 12.7% and a simple semantic 
parsing baseline of 24.3%. 

2 Task 

Our task is as follows: given a table t and a ques¬ 
tion x about the table, output a list of values y 
that answers the question according to the table. 
Example inputs and outputs are shown in Fig¬ 
ure [T] The system has access to a training set 
D = {(ay, ti, yi)}^Li of questions, tables, and an¬ 
swers, but the tables in test data do not appear dur¬ 
ing training. 

The only restriction on the question x is that a 
person must be able to answer it using just the ta¬ 
ble t. Other than that, the question can be of any 
type, ranging from a simple table lookup question 
to a more complicated one that involves various 
logical operations. 



A[Year... ].argmax(... Greece, Index) 


Figure 2: The prediction framework: (1) the table 
t is deterministically converted into a knowledge 
graph w as shown in Figure [3j (2) with informa¬ 
tion from w, the question x is parsed into candi¬ 
date logical forms in Z x ; (3) the highest-scoring 
candidate z E Z x is chosen; and (4) z is executed 
on w, yielding the answer y. 

Dataset. We created a new dataset, Wik¬ 
iTableQuestions, of question-answer pairs on 
HTML tables as follows. We randomly selected 
data tables from Wikipedia with at least 8 rows and 
5 columns. We then created two Amazon Mechan¬ 
ical Turk tasks. The first task asks workers to write 
trivia questions about the table. For each question, 
we put one of the 36 generic prompts such as “The 
question should require calculation ” or “contains 
the word ‘first’ or its synonym ” to encourage more 
complex utterances. Next, we submit the result¬ 
ing questions to the second task where the work¬ 
ers answer each question based on the given table. 
We only keep the answers that are agreed upon by 
at least two workers. After this filtering, approxi¬ 
mately 69% of the questions remains. 

The final dataset contains 22,033 examples on 
2,108 tables. We set aside 20% of the tables and 
their associated questions as the test set and de¬ 
velop on the remaining examples. Simple pre¬ 
processing was done on the tables: We omit all 
non-textual contents of the tables, and if there is a 
merged cell spanning many rows or columns, we 
unmerge it and duplicate its content into each un¬ 
merged cell. Section |7.2| analyzes various aspects 
of the dataset and compares it to other datasets. 

3 Approach 

We now describe our semantic parsing framework 
for answering a given question and for training the 
model with question-answer pairs. 


















Prediction. Given a table t and a question x, 
we predict an answer y using the framework il¬ 
lustrated in Figure [2] We first convert the table 
t into a knowledge graph w (“world”) which en¬ 
codes different relations in the table (Section [4]) . 
Next, we generate a set of candidate logical forms 
Z x by parsing the question x using the informa¬ 
tion from w (Section [6dj ). Each generated logical 
form z £ Z x is a graph query that can be exe¬ 
cuted on the knowledge graph w to get a denota¬ 
tion Hu>- We extract a feature vector < j>(x,w,z) 
for each z G Z x (Section \6.2\ and define a log- 
linear distribution over the candidates: 

Pq(z \ x,w) oc exp{9 T 4>(x,w, z)}, ( 1 ) 
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Figure 3: Part of the knowledge graph corre¬ 
sponding to the table in Figure [T] Circular nodes 
are row nodes. We augment the graph with dif¬ 
ferent entity normalization nodes such as Number 
and Date (red) and additional row node relations 
Next and Index (blue). 


where 9 is the parameter vector. Finally, we 
choose the logical form z with the highest model 
probability and execute it on w to get the answer 
denotation y = HU- 

Training. Given training examples V = 
{(xi, we seek a parameter vector 9 

that maximizes the regularized log-likelihood of 
the correct denotation y t marginalized over logi¬ 
cal forms z. Formally, we maximize the objective 
function 

1 N 

J (#) = | Xi,Wi) - A , (2) 

2=1 

where W{ is deterministically generated from ti, 
and 


pe(y\x,w)= ^2 p e {z\x,w). (3) 


z£Z x ;y=[z] t 


We optimize 9 using AdaGrad (Duchi et al., 
20101, running 3 passes over the data. We use L\ 
regularization with A = 3 x 10 -5 obtained from 
cross-validation. 

The following sections explain individual sys¬ 
tem components in more detail. 


4 Knowledge graph 

Inspired by the graph representation of knowledge 
bases, we preprocess the table t by deterministi¬ 
cally converting it into a knowledge graph w as 
illustrated in Figure [3] In the most basic form, ta¬ 
ble rows become row nodes, strings in table cells 
become entity nodesQand table columns become 
directed edges from the row nodes to the entity 

’Two occurrences of the same string constitute one node. 


nodes of that column. The column headers are 
used as edge labels for these row-entity relations. 

The knowledge graph representation is conve¬ 
nient for three reasons. First, we can encode dif¬ 
ferent forms of entity normalization in the graph. 
Some entity strings (e.g., “1900”) can be inter¬ 
preted as a number, a date, or a proper name de¬ 
pending on the context, while some other strings 
(e.g., “200 km”) have multiple parts. Instead of 
committing to one normalization scheme, we in¬ 
troduce edges corresponding to different normal¬ 
ization methods from the entity nodes. For exam¬ 
ple, the node 1900 will have an edge called Date 
to another node 1900-XX-XX of type date. Apart 
from type checking, these normalization nodes 
also aid learning by providing signals on the ap¬ 
propriate answer type. For instance, we can define 
a feature that associates the phrase “how many” 
with a logical form that says “traverse a row-entity 
edge, then a Number edge” instead of just “traverse 
a row-entity edge.” 

The second benefit of the graph representation 
is its ability to handle various logical phenomena 
via graph augmentation. For example, to answer 
questions of the form “What is the next... ?” or 
“Who came before ... ?”, we augment each row 
node with an edge labeled Next pointing to the 
next row node, after which the questions can be 
answered by traversing the Next edge. In this 
work, we choose to add two special edges on each 
row node: the Next edge mentioned above and 
an Index edge pointing to the row index number 
( 0 , 1 , 2 ,...). 

Finally, with a graph representation, we can 
query it directly using a logical formalism for 
knowledge graphs, which we turn to next. 



















Name 


Example 


Join 


City.Athens 

(row nodes with a City edge to Athens) 
City.(Athens U Beijing) 

City.Athens n Year.Number. <.1990 
R[Year]. City. Athens 
(entities where a row in City.Athens has a Year edge to) 
Aggregation count (City. Athens) 

(the number of rows with city Athens) 
argmax(City.Athens, Index) 

(the last row with city Athens) 
sub(204, 201 ) (= 204 - 201) 

ArfYear.Date.a;] 

(a binary: composition of two relations) 


Union 

Intersection 

Reverse 


Superlative 

Arithmetic 

Lambda 


Table 1: The lambda DCS operations we use. 


5 Logical forms 


Rule 


Semantics Example 


Anchored to the utterance 

TokenSpan —> Entity match(zi) Greece 

(match(s) = entity with name s) anchored to “Greece’ 
TokenSpan —r Atomic val(zi) 2012-07-XX 

(val(s) = interpreted value) anchored to “July 2012’ 


Unanchored (floating) 


0 —» Relation r 

(r = row-entity relation) 

0 —> Relation A x[r.p.x\ 

(p = normalization relation) 

0 —» Records Type.Row 

0 —» RecordFn Index 


Country 

A.rfYear.Date.a:] 

(list of all rows) 
(row row index) 


Table 2: Base deduction rules. Entities and 
atomic values (e.g., numbers, dates) are anchored 
to token spans, while other predicates are kept 
floating, (a -t— b represents a binary mapping b 
to a.) 


As our language for logical forms, we use 
lambda dependency-based compositional seman¬ 
tics (Liang, 20131, or lambda DCS, which we 
briefly describe here. Each lambda DCS logical 
form is either a unary (denoting a list of values) or 
a binary (denoting a list of pairs). The most basic 
unaries are singletons (e.g., China represents an 
entity node, and 30 represents a single number), 
while the most basic binaries are relations (e.g., 
City maps rows to city entities, Next maps rows 
to rows, and >= maps numbers to numbers). Log¬ 
ical forms can be combined into larger ones via 
various operations listed in Table [T] Each opera¬ 
tion produces a unary except lambda abstraction: 
A x[f(x)} is a binary mapping x to f(x). 


6 Parsing and ranking 

Given the knowledge graph w, we now describe 
how to parse the utterance x into a set of candidate 
logical forms Z x 

6.1 Parsing algorithm 

We propose a new floating parser which is more 
flexible than a standard chart parser. Both parsers 
recursively build up derivations and corresponding 
logical forms by repeatedly applying deduction 
rules, but the floating parser allows logical form 
predicates to be generated independently from the 
utterance. 

Chart parser. We briefly review the CKY al¬ 
gorithm for chart parsing to introduce notation. 
Given an utterance with tokens x\,...,x n , the 
CKY algorithm applies deduction rules of the fol¬ 


lowing two kinds: 

(TokenSpan, i,j)[s\ -A (c,i,j)[f(s)], (4) 

(ci,i, k)[zi\ + (c 2 , k + l,j)[z 2 \ (5) 

(cfl,j)[f(zi,z 2 )}. 

The first rule is a lexical rule that matches an utter¬ 
ance token span x % ■ ■ ■ x } (e.g., s = “New York”) 
and produces a logical form (e.g., f(s ) = 
NewYorkCity) with category c (e.g., Entity). 
The second rule takes two adjacent spans giv¬ 
ing rise to logical forms z\ and and builds a 
new logical form f(z\, Algorithmically, CKY 
stores derivations of category c covering the span 
xi ■ ■ ■ Xj in a cell (c,i,j ). CKY fills in the cells of 
increasing span lengths, and the logical forms in 
the top cell (ROOT, 1, to) are returned. 

Floating parser. Chart parsing uses lexical 
rules Q to generate relevant logical predicates, 
but in our setting of semantic parsing on tables, 
we do not have the luxury of starting with or 
inducing a full-fledged lexicon. Moreover, there 
is a mismatch between words in the utterance 
and predicates in the logical form. For in¬ 
stance, consider the question “Greece held its 
last Summer Olympics in which year?” on the 
table in Figure [T] and the correct logical form 

R[A:r[Year.Date. 2 :]].argmax(Country. Greece, Index). 

While the entity Greece can be anchored to the 
token “Greece”, some logical predicates (e.g., 
Country) cannot be clearly anchored to a token 
span. We could potentially learn to anchor the 
logical form Country.Greece to “Greece”, but if 
the relation Country is not seen during training, 
such a mapping is impossible to learn from the 








Rule 

Semantics 

Example 

Entity or Atomic —¥ Values 

Zl 

Join + Aggregate 

China 

Atomic —> Values 

C.Z\ 

>=.30 (at least 30) 

(ce {<,>,<=,>= 

=}) 


Relation + Values —> Records 

2i-2 2 

Country.China (events (rows) where the country is China) 

Relation + Records —> Values 

R[zi].Z2 

R[Year],Country.China (years of events in China) 

Records —> Records 

Next. 2 i 

Next.Country.China (.. .before China) 

Records —> Records 

R[Next].«i 

R[Next],Country.China (.. . after China) 

Values —> Atomic 

a(zi) 

count (Country.China) (How often did China ... ) 

(a £ {count, max, min. sum, avg}) 


Values —¥ ROOT 

Zl 


Relation —> RecordFn 

Zl 

Superlative 

Ar[Nations.Number. x] (row value in Nations column) 

Records + RecordFn —¥ Records 

S(Z1,Z2) 

argmax(Type.Row, A.t[N ations.Number.a;]) 


n}) (events with the most participating nations) 

argmin(City.Athens, Index) (first event in Athens) 
R[Ax[a(ti.*)]] R[Ax[count(City.a;)]] (city num. of rows with that city) 
\x[R.[zi].Z2-x] \x [R[City], Nat ions. Number, x] 

(city <— value in Nations column) 

s(zi, Z 2 ) argmax(..., R[Ar[count(City.a:)]]) (most frequent city) 


Relation —> ValueFn 
Relation + Relation —> ValueFn 

Values + ValueFn —y Values 


Other operations 

ValueFn + Values + Values —> Values o(R[~i].z 2 , R[zi].z 3 ) sub(R[Number],R[Nations].City.London,. .. ) 

(o £ {add, sub, mul, div}) (How many more participants were in London than ...) 

Entity + Entity —» Values zi U 22 China U France (China or France) 

Records + Records —y Records Zi fl 22 City.Bei j ing fl Country.China (... in Beijing, China) 


Table 3: Compositional deduction rules. Each rule c±,... ,Ck -> c takes logical forms zy,... ,Zk 
constructed over categories c\,... ,Ck, respectively, and produces a logical form based on the semantics. 


training data. Similarly, some prominent tokens 
(e.g., “Olympics”) are irrelevant and have no 
predicates anchored to them. 

Therefore, instead of anchoring each predicate 
in the logical form to tokens in the utterance via 
lexical rules, we propose parsing more freely. We 
replace the anchored cells ( c,i,j ) with floating 
cells (c, s) of category c and logical form size s. 
Then we apply rules of the following three kinds: 

(TokenSpan,i,j)[s\ -A (c, 1)[/(»], (6) 

0->(c,l)[/()], (7) 

(ci, st)[-t] + (c 2 , S 2 ) [^ 2 ] (8) 

-A (c, si + S 2 + l)[f(zi,z 2 )]. 


( Values, 8) 

R[A:r[Year.Date.x]].argmax(Country.Greece, Index) 



( Relation, 1) ( Records, 6) 

AtrfYear.Date.tc] argmax(Country.Greece, Index) 



(Records ,4) ( RecordFn, 1) 

Country.Greece Index 



(Relation, 1) ( Values, 2) 

Country Greece 

(Entity, 1) 

Greece 

(TokenSpan, 1,1) 
“ Greece” 


Note that rules © arc similar to 0 in chart 
parsing except that the floating cell (c, 1 ) only 
keeps track of the category and its size 1, not 
the span (i. j). Rules © allow us to construct 
predicates out of thin air. For example, we can 
construct a logical form representing a table rela¬ 
tion Country in cell ( Relation , 1) using the rule 
0 -A Relation [Country] independent of the ut¬ 
terance. Rules ([8]) perform composition, where 
the induction is on the size s of the logical form 
rather than the span length. The algorithm stops 
when the specified maximum size is reached, after 
which the logical forms in cells ( ROOT , s) for any 


Figure 4: A derivation for the utterance “Greece 
held its last Summer Olympics in which year?” 
Only Greece is anchored to a phrase “Greece”', 
Year and other predicates are floating. 

s are included in Z x . Figure [4] shows an example 
derivation generated by our floating parser. 

The floating parser is very flexible: it can skip 
tokens and combine logical forms in any order. 
This flexibility might seem too unconstrained, but 
we can use strong typing constraints to prevent 
nonsensical derivations from being constructed. 
Tables [2] and [3] show the full set of deduction 






“Greece held its last Summer Olympics in which year?" 
z = RjAajYear.Number.a:]].argmax(Type.Row, Index) 
y = {2012} (type: Num, column: Year) 


Feature Name 

Note 


(“tor”, predicate = argmax) 
phrase = predicate 

lex 

unlex (•_• “year” 

= Year) 

missing entity 

unlex ('.’ missing Greece ) 

denotation type = Num 
denotation column = Year 
{“which year" , type = Num) 
phrase = column 

lex 

unlex ('.• “year” 

= Year) 

{Q = “which”, type = Num) 

(H = “year” , type = Num) 

H = column 

lex 

lex 

unlex (•• “year" 

= YEAR) 


Table 4: Example features that fire for the (incor¬ 
rect) logical form z. All features are binary, (lex = 
lexicalized) 


rules we use. We assume that all named entities 
will explicitly appear in the question x, so we an¬ 
chor all entity predicates (e.g., Greece) to token 
spans (e.g., “Greece”). We also anchor all numer¬ 
ical values (numbers, dates, percentages, etc.) de¬ 
tected by an NER system. In contrast, relations 
(e.g., Country) and operations (e.g., argmax) are 
kept floating since we want to learn how they 
are expressed in language. Connections between 
phrases in x and the generated relations and op¬ 
erations in a are established in the ranking model 
through features. 

6.2 Features 

We define features <p(x, w, z) for our log-linear 
model to capture the relationship between the 
question x and the candidate z. Table [4] shows 
some example features from each feature type. 
Most features are of the form ( f(x),g(z )) or 
( f(x),h(y )) where y = \z\ w is the denotation, 
and /, g, and h extract some information (e.g., 
identity, POS tags) from x, z, or y, respectively. 

phrase-predicate: Conjunctions between n- 
grams f(x) from x and predicates g(z) from z. 
We use both lexicalized features, where all possi¬ 
ble pairs ( f(x),g(z )) form distinct features, and 
binary unlexicalized features indicating whether 
f(x) and g(z) have a string match. 

missing-predicate: Indicators on whether there 
are entities or relations mentioned in x but not in 
z. These features are unlexicalized. 

denotation: Size and type of the denotation 
y = The type can be either a primitive type 

(e.g., Num, Date, Entity) or the name of the 
column containing the entity in y (e.g., City). 
phrase-denotation: Conjunctions between n- 


grams from x and the types of y. Similar to the 
phrase-predicate features, we use both lexicalized 
and unlexicalized features. 

headword-denotation: Conjunctions between 
the question word Q (e.g., what, who, how many) 
or the headword H (the first noun after the ques¬ 
tion word) with the types of y. 

6.3 Generation and pruning 

Due to their recursive nature, the rules allow us 
to generate highly compositional logical forms. 
However, the compositionality comes at the cost 
of generating exponentially many logical forms, 
most of which are redundant (e.g., logical forms 
with an argmax operation on a set of size 1). We 
employ several methods to deal with this combi¬ 
natorial explosion: 

Beam search. We compute the model proba¬ 
bility of each partial logical form based on avail¬ 
able features (i.e., features that do not depend on 
the final denotation) and keep only the K = 200 
highest-scoring logical forms in each cell. 

Pruning. We prune partial logical forms that 
lead to invalid or redundant final logical forms. 
For example, we eliminate any logical form that 
does not type check (e.g., Beijing U Greece), 
executes to an empty list (e.g., Year.Number. 24), 
includes an aggregate or superlative on a singleton 
set (e.g., argmax(Year.Number. 2012, Index)), or 
joins two relations that arc the reverses of each 
other (e.g., R[City].City.Beijing). 

7 Experiments 

7.1 Main evaluation 

We evaluate the system on the development sets 
(three random 80:20 splits of the training data) and 
the test data. In both settings, the tables we test on 
do not appear during training. 

Evaluation metrics. Our main metric is accu¬ 
racy, which is the number of examples (x, t, y) 
on which the system outputs the correct answer y. 
We also report the oracle score, which counts the 
number of examples where at least one generated 
candidate z G Z x executes to y. 

Baselines. We compare the system to two base¬ 
lines. The first baseline (IR), which simulates in¬ 
formation retrieval, selects an answer y among the 
entities in the table using a log-linear model over 
entities (table cells) rather than logical forms. The 
features are conjunctions between phrases in x and 




dev test 



acc 

ora 

acc 

ora 

IR baseline 

13.4 

69.1 

12.7 

70.6 

WQ baseline 

23.6 

34.4 

24.3 

35.6 

Our system 

37.0 

76.7 

37.1 

76.6 


Table 5: Accuracy (acc) and oracle scores (ora) 
on the development sets (3 random splits of the 
training data) and the test data. 


Operation 

Amount 

join (table lookup) 

13.5% 

+ join with Next 

+ 5.5% 

+ aggregate (count, sum, max, ...) 

+ 15.0% 

+ superlative (argmax, argmin) 

+ 24.5% 

+ arithmetic, n, U 

+ 20.5% 

+ other phenomena 

+ 21.0% 


Table 7: The logical operations required to answer 
the questions in 200 random examples. 



acc 

ora 

Our system 

37.0 

76.7 

(a) Rule Ablation 



join only 

10.6 

15.7 

join + count (= WQ baseline) 

23.6 

34.4 

join + count + superlative 

30.7 

68.6 

all - {n,u} 

34.8 

75.1 

(b) Feature Ablation 



all — features involving predicate 

11.8 

74.5 

all — phrase-predicate 

16.9 

74.5 

all — lex phrase-predicate 

17.6 

75.9 

all — unlex phrase-predicate 

34.3 

76.7 

all — missing-predicate 

35.9 

76.7 

all — features involving denotation 

33.5 

76.8 

all — denotation 

34.3 

76.6 

all — phrase-denotation 

35.7 

76.8 

all — headword-denotation 

36.0 

76.7 

(c) Anchor operations to trigger words 

37.1 

59.4 


Table 6: Average accuracy and oracle scores on 
development data in various system settings. 


properties of the answers y, which cover all fea¬ 
tures in our main system that do not involve the 
logical form. As an upper bound of this baseline, 
69.1% of the development examples have the an¬ 
swer appearing as an entity in the table. 

In the second baseline (WQ), we only allow de¬ 
duction rules that produce join and count logical 
forms. This rule subset has the same logical cov¬ 
erage as Berant and Liang (2014), which is de¬ 


signed to handle the WebQuestions (Berant et 


|al., 2013 ) and Free917 ( |Cai and Yates, 2013 ) 
datasets. 

Results. Table [5] shows the results compared 
to the baselines. Our system gets an accuracy 
of 37.1% on the test data, which is significantly 
higher than both baselines, while the oracle is 
76.6%. The next subsections analyze the system 
components in more detail. 


7.2 Dataset statistics 

In this section, we analyze the breadth and depth 
of the WikiTableQuestions dataset, and how 
the system handles them. 

Number of relations. With 3,929 unique col¬ 
umn headers (relations) among 13,396 columns, 


the tables in the WikiTableQuestions dataset 
contain many more relations than closed-domain 
datasets such as Geoquery (Zelle and Mooney, 


1996) and ATIS (Price, 1990). Additionally, the 


logical forms that execute to the correct denota¬ 
tions refer to a total of 2,056 unique column head¬ 
ers, which is greater than the number of relations 
in the Free 917 dataset (635 Freebase relations). 


Knowledge coverage. We sampled 50 exam¬ 
ples from the dataset and tried to answer them 
manually using Freebase. Even though Free- 
base contains some information extracted from 
Wikipedia, we can answer only 20% of the ques¬ 
tions, indicating that WikiTableQuestions 
contains a broad set of facts beyond Freebase. 

Logical operation coverage. The dataset cov¬ 
ers a wide range of question types and logical 
operations. Table [6ja) shows the drop in oracle 
scores when different subsets of rules are used to 
generate candidates logical forms. The join only 
subset corresponds to simple table lookup, while 
join + count is the WQ baseline for Freebase ques¬ 
tion answering on the WebQuestions dataset. 
Finally, join + count + superlative roughly corre¬ 
sponds to the coverage of the Geoquery dataset. 

To better understand the distribution of log¬ 
ical operations in the WikiTableQuestions 
dataset, we manually classified 200 examples 
based on the types of operations required to an¬ 
swer the question. The statistics in Table [7] shows 
that while a few questions only require simple 
operations such as table lookup, the majority of 
the questions demands more advanced operations. 
Additionally, 21% of the examples cannot be an¬ 
swered using any logical form generated from the 
current deduction rules; these examples are dis¬ 
cussed in Section 17741 


Compositionality. From each example, we 
compute the logical form size (number of rules 
applied) of the highest-scoring candidate that exe¬ 
cutes to the correct denotation. The histogram in 
Figure [5] shows that a significant number of logical 




























Figure 5: Sizes of the highest-scoring correct can¬ 
didate logical forms in development examples. 
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Figure 6: Accuracy (solid red) and oracle (dashed 
blue) scores with different beam sizes. 


forms are non-trivial. 

Beam size and pruning. Figure [6] shows the 
results with and without pruning on various beam 
sizes. Apart from saving time, pruning also pre¬ 
vents bad logical forms from clogging up the beam 
which hurts both oracle and accuracy metrics. 


7.3 Features 

Effect of features. Table |6}b) shows the accu¬ 
racy when some feature types are ablated. The 
most influential features are lexicalized phrase- 
predicate features, which capture the relationship 
between phrases and logical operations (e.g., relat¬ 
ing “last” to argmax) as well as between phrases 
and relations (e.g., relating “before” to < or Next, 
and relating “who” to the relation Name). 

Anchoring with trigger words. In our parsing 
algorithm, relations and logical operations are not 
anchored to the utterance. We consider an alter¬ 
native approach where logical operations are an¬ 
chored to “trigger” phrases, which are hand-coded 
based on co-occurrence statistics (e.g., we trigger 
a count logical form with how, many, and total). 

Table [6jc) shows that the trigger words do not 
significantly impact the accuracy, suggesting that 
the original system is already able to learn the re¬ 
lationship between phrases and operations even 
without a manual lexicon. As an aside, the huge 
drop in oracle is because fewer “semantically in¬ 
correct” logical forms are generated; we discuss 
this phenomenon in the next subsection. 


7.4 Semantically correct logical forms 

In our setting, we face a new challenge that arises 
from learning with denotations: with deeper com- 
positionality, a larger number of nonsensical log¬ 
ical forms can execute to the correct denotation. 
For example, if the target answer is a small num¬ 
ber (say, 2), it is possible to count the number of 
rows with some random properties and arrive at 
the correct answer. However, as the system en¬ 
counters more examples, it can potentially learn to 
disfavor them by recognizing the characteristics of 
semantically correct logical forms. 

Generating semantically correct logical 
forms. The system can learn the features of 
semantically correct logical forms only if it can 
generate them in the first place. To see how well 
the system can generate correct logical forms, 
looking at the oracle score is insufficient since 
bad logical forms can execute to the correct 
denotations. Instead, we randomly chose 200 ex¬ 
amples and manually annotated them with logical 
forms to see if a trained system can produce the 
annotated logical form as a candidate. 

Out of 200 examples, we find that 79% can 
be manually annotated. The remaining ones in¬ 
clude artifacts such as unhandled question types 
(e.g., yes-no questions, or questions with phrases 
“same” or “consecutive”), table cells that require 
advanced normalization methods (e.g., cells with 
comma-separated lists), and incorrect annotations. 

The system generates the annotated logical 
form among the candidates in 53.5% of the ex¬ 
amples. The missing examples are mostly caused 
by anchoring errors due to lexical mismatch (e.g., 
“Italian” —> Italy, or “no zip code ” —t an empty 
cell in the zip code column) or the need to generate 
complex logical forms from a single phrase (e.g., 
“May 2010” -> >=.2010-05-01n<=.2010-05-31). 


7.5 Error analysis 


The errors on the development data can be divided 
into four groups. The first two groups are unhan¬ 
dled question types (21%) and the failure to an¬ 


chor entities (25%) as described in Section 7.4 


The third group is normalization and type errors 
(29%): although we handle some forms of en¬ 
tity normalization, we observe many unhandled 
string formats such as times (e.g., 3:45.79) and 
city-country pairs (e.g., Beijing, China), as well as 
complex calculation such as computing time peri¬ 
ods (e.g., 12pm-lam —> 1 hour). Finally, we have 












ranking errors (25%) which mostly occur when the 
utterance phrase and the relation are obliquely re¬ 
lated (e.g., “airplane” and Model). 


8 Discussion 


Our work simultaneously increases the breadth of 
knowledge source and the depth of compositional- 
ity in semantic parsing. This section explores the 
connections in both aspects to related work. 

Logical coverage. Different semantic parsing 
systems are designed to handle different sets of 
logical operations and degrees of compositional- 
ity. For example, form-tilling systems (Wang et 


al., 2011) usually cover a smaller scope of opera¬ 


tions and compositionality, while early statistical 


semantic parsers for question answering (Wong 


and Mooney, 2007] Zettlemoyer and Collins, 

2007) and high-accuracy natural language inter- 

faces for databases (Androutsopoulos et al., 1995 

Popescu et al., 2003) target more compositional 

utterances with a wide range of logical opera¬ 
tions. This work aims to increase the logical 
coverage even further. For example, compared 
to the Geoquery dataset, the WikiTableQues- 
TIONS dataset includes a move diverse set of log¬ 
ical operations, and while it does not have ex¬ 
tremely compositional questions like in Geoquery 
(e.g., “What states border states that border states 
that border Florida?”), our dataset contains fairly 
compositional questions on average. 

To parse a compositional utterance, many works 
rely on a lexicon that translates phrases to enti¬ 
ties, relations, and logical operations. A lexicon 

can be automatically generated (Unger and Cimi- 

ano, 2011] Unger et al., 2012), learned from data 

(Zettlemoyer and Collins, 2007 Kwiatkowski et 

al., 2011), or extracted from external sources (Cai 

and Yates, 2013; Berant et al., 2013), but requires 

some techniques to generalize to unseen data. Our 
work takes a different approach similar to the log- 

ical form growing algorithm in Berant and Liang 

(2014) by not anchoring relations and operations 
to the utterance. 

Knowledge domain. Recent works on seman¬ 
tic parsing for question answering operate on more 
open and diverse data domains. In particular, 
large-scale knowledge bases have gained popular- 
ity in the semantic parsing community (Cai and 

Yates, 2013 Berant et al., 2013 Fader et al., 


2014). The increasing number of relations and en¬ 


tities motivates new resources and techniques for 


improving the accuracy, including the use of ontol¬ 


ogy matching models (Kwiatkowski et al., 20131, 


paraphrase models (Fader et al., 2013 

Berant and 

Liang, 2014), and unlabeled sentences (Krishna- 

murthy and Kollar, 2013; Reddy et al., 2014 

). 


Our work leverages open-ended data from the 
Web through semi-structured tables. There have 
been several studies on analyzing or inferring the 
table schemas (jCafarella et al., 2008|[Venetis et al., 


2011][Sy~ed et al., 20101 Limaye et al., 20 10) and 


answering search queries by joining tables on sim¬ 
ilar columns (Cafarella et al., 2008j |Gonzalez~et 


al., 201 Of Pimplikar and Sarawagi, 2012). While 


the latter is similar to question answering, the 
queries tend to be keyword lists instead of natural 
language sentences. In parallel, open information 
extraction ( [Wu and Weld, 2010} [Masaum et al., 


2012) and knowledge base population (Ji and Gr- 


ishman, 2011 j) extract information from web pages 


and compile them into structured data. The result¬ 
ing knowledge base is systematically organized, 
but as a trade-off, some knowledge is inevitably 
lost during extraction and the information is forced 
to conform to a specific schema. To avoid these is¬ 
sues, we choose to work on HTML tables directly. 

In future work, we wish to draw informa¬ 
tion from other semi-structured formats such as 


colon-delimited pairs (Wong et al., 2009), bulleted 
lists (Gupta and Sarawagi, 2009]), and top-A: lists 
(Zhang et al., 2013). Pasupat and Liang (2014} 
used a framework similar to ours to extract entities 
from web pages, where the “logical forms” were 
XPath expressions. A natural direction is to com¬ 
bine the logical compositionality of this work with 
the even broader knowledge source of general web 
pages. 
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Data and reproducibility. The WikiTable- 
Questions dataset can be downloaded at http: 

//nip.Stanford.edu/software/sempre/wikitable/ 

Additionally, code, data, and experiments for 
this paper are available on the CodaLab plat¬ 
form at https://www.codalab.org/worksheets/ 
Oxf26cd79d4d734287868923adl067cf4c/ 
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